Introduction

Github repository containing notebook and important plots: <br GitHubP1

Problem

In this project we are first presented the Franke function:
$$ f_1(x,y) = 0.75e^{-(0.25(9x-2)^2-0.25(9y-2)^2} \\ f_2(x,y) = 0.75e^{-\frac{(9x+1)^2}{49.0}-0.1(9y+1)} \\ f_3(x,y) = 0.5e^{-\frac{(9x-7)^2}{4.0}-0.25(9y-3)^2} \\ f_4(x,y) = -0.2e^{-(9x-4)^2-(9y-7)^2} \\ f(x,y) = f_1(x,y)+f_2(x,y)+f_3(x,y)+f_4(x,y) $$

where $f(x,y)$ is the Franke function, on the intervals $x\in[0,1]$ and $y\in[0,1]$.
To properly show this surface, a meshgrid of values between zero and one were made, and the Franke function evaluated at all points.
In exercises 1-5, we are exploring regression methods; ordinary least squares (henceforth referred to as OLS), Ridge, and Lasso methods, applied on a training data set of the Franke function to see if we can model the data in the function.
More precisely, we assume that the data is given by
$$ z = f(x,y) + \epsilon, $$ where we assume that $\epsilon \sim \mathcal{N}(0,\sigma)$ is normally distributed with zero mean and a standard deviation of $\sigma$.
We are trying to make a polynomial fit to this data, with maximal degree k: $$ \tilde{f}(x,y) = \sum_{i=0}^k\sum_{j=0}^i \beta_{n}x^{i-j}y^j, $$ where $n\in[0, \frac{(k+1)(k+2)}{2}-1]\in\mathbb{N}$, and the $\beta_n$'s are the weights. which can be rewritten like $$ \tilde{f}(\mathbf{m}) = \mathbf{m^T}\mathbf{\beta}. $$ Here $\mathbf{m}$ are all the elements of the polynomial: $$ \mathbf{m^T} = [1, x, y, x^2, xy, y^2, \dots, xy^{k-1}, y^k], $$ and $\mathbf{\beta}$ are the corresponding weights: $$ \mathbf{\beta^T} = [\beta_0, \beta_1, \dots, \beta_p], $$ where $p = \frac{(k+1)(k+2)}{2}-1$.

Making our design matrix

We are using $N_x$ values for x and $N_y$ values for y. The design matrix is made by using all the different arguments in our dataset, and setting their values into $\mathbf{m^T}$. For us, that is a meshgrid, so for any value of x there are $N_y$ values of y, and for any value of y there are $N_x$ values of x. Everytime we insert a different set of values, we make a row in the design matrix. Our design matrix will therefore have dimensions $(N_xN_y)\times (p+1)$. The design matrix will look something like this: $$ X = \begin{vmatrix} 1 & x_1 & y_1 & x_1^2 & \dots & x_1y_1^{p-1} & y_1^{p} \\ 1 & x_1 & y_2 & x_1^2 & \dots & x_1y_2^{p-1} & y_2^{p} \\ 1 & x_1 & y_3 & x_1^2 & \dots & x_1y_3^{p-1} & y_3^{p} \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ 1 & x_{N_x} & y_{N_y-2} & x_{N_x}^2 & \dots & x_{N_x}y_{N_y-2}^{p-1} & y_{N_y-2}^p \\ 1 & x_{N_x} & y_{N_y-1} & x_{N_x}^2 & \dots & x_{N_x}y_{N_y-1}^{p-1} & y_{N_y-1}^p \\ 1 & x_{N_x} & y_{N_y} & x_{N_x}^2 & \dots & x_{N_x}y_{N_y}^{p-1} & y_{N_y}^p \\ \end{vmatrix} $$

Why do we make this matrix?

We want $\tilde{f}(\mathbf{X})$ to be as close to z as possible, even for data we haven't trained on. What will be done in this project, is that a dataset (z) will be created, then a design matrix will be made based on the input to the function that created the dataset. Then the dataset and the design matrix will be split into a training set and a test set (ratio = 4:1). The values of the training and testing sets will be randomly picked. What will happen now is that the values of beta that creates an $\tilde{f}$ that is as close to the dataset as possible, based on the training set.

How do we measure proximity to dataset?

Here is where the cost function comes in. The cost function calculates the squared of the distance between the points in the dataset and the fitted function: $$ C(\beta) = \sum_{i=0}^N(z_i-\tilde{f}_i)^2, $$ where $N$ is the number of values in z we are calculating the proximity to. To minimize this function, we find the derivate with respect to $\beta$ and set it equal zero:
$$ \frac{dC}{d\mathbf{\beta}} = \frac{d}{d\mathbf{\beta}}(\mathbf{z}^2-2X^T\mathbf{\beta}\mathbf{z}+ \mathbf{\beta^T}XX^T\mathbf{\beta}) = 0, $$ which yields our favourite ordinary least squares $\beta$: $$ \beta = (X^TX)^{-1}X^T\mathbf{z}. $$ The $\beta$ calculated based on the training set of the design matrix and the training set of z will then be applied to try and predict the values in the testing set of z, by using $\tilde{f}$: $$ \tilde{f} = X_{test}\beta. $$

Methods used

Ordinary least squares

Above the cost-function is what in mathematics is called the $L_2$-norm of the difference between the two vectors $\mathbf{z}$ (dataset) and $X^T\mathbf{\beta}$ (model), and we seek to minimize this norm. This leads to ordinary least squares method, written mathematically like this: $$ \mathbf{\beta}_{OLS} = arg min_{\mathbf{\beta}\in\mathbb{R}^p}||\mathbf{z}-X\mathbf{\beta}||_2^2, $$ where we calculate the $\mathbf{\beta}_{OLS}$ using the training set of z and X, and then apply it on our test set to see if it is a good predictor of the data behaviour. As we saw above this leads to $\mathbf{\beta}$ being equal to: $$ \mathbf{\beta}_{OLS} = (X^TX)^{-1}X^T\mathbf{z}. $$

Ridge regression

The Ridge regression method adds something called a regularization parameter $\lambda$, which is multiplied with the $L_2$-norm of $\mathbf{\beta}$. It will be added to the $L_2$-norm of the difference between the datapoints and the feature matrix multiplied with $\mathbf{\beta}$:
$$ \mathbf{\beta}_{Ridge} = argmin_{\mathbf{\beta}\in\mathbb{R}^p}||\mathbf{z}-X\mathbf{\beta}||_2^2+\lambda||\mathbf{\beta}||_2^2, $$ To minimize this function we again apply derivation with respect to beta, and set it equal to 0. This is very similar to the OLS-method, but now we will have an additional element: $$ \frac{dC(\mathbf{\beta})}{d\mathbf{\beta}} = \frac{d}{d\mathbf{\beta}}(\mathbf{z}^2-2X^T\mathbf{\beta}\mathbf{z}+ \mathbf{\beta^T}XX^T\mathbf{\beta} + \lambda\mathbf{\beta}^T\mathbf{\beta}) = 0 $$ which further becomes: $$ X^T\mathbf{z} = (XX^T+\lambda I)\mathbf{\beta}, $$ which in turn yields $\mathbf{\beta}_{Ridge}$: $$ \mathbf{\beta}_{Ridge} = (X^TX+\lambda I)^{-1}X^T\mathbf{z}. $$ This will be applied in the exactly same manner as the OLS-regression method: Use training sets to find $\mathbf{\beta}_{Ridge}$ and then use it on $X_{test}$ to find a prediction for $\mathbf{z}_{test}$.

LASSO regression

LASSO is an acronym, and stands for Least Absolute Shrinkage and selection operator. It also uses a regularization parameter, but this time it's multiplied with the $L_1$-norm of $\mathbf{\beta}$, and thus the expression we want to minimize, and thus find the optimal $\mathbf{\beta}$, becomes: $$ \mathbf{\beta}_{Lasso} = argmin_{\mathbf{\beta}\in\mathbb{R}^p}||\mathbf{z}-X\mathbf{\beta}||_2^2+\lambda||\mathbf{\beta}||_1. $$ The $L_1$-norm is just the sum of the absolute values of the parameters of $\mathbf{\beta}$. You can make conditional statements about the behaviour of this function, but we won't go into it here. We used the module scikit-learn, which has the LASSO function built in, when we wanted to perform LASSO regression.

Measuring error between predictions and dataset

We will use something that is similar to the cost function itself to measure error, namely the mean-squared error. The mean-squared error finds the distance between the values of the dataset and the prediction, squares it, then divides by the number of datapoints: $$ MSE(z, \tilde{f}) = \frac{1}{N}\sum_{i=0}^{N-1} (z_i-\tilde{f}_i)^2. $$ Another useful tool to measure the error is $R^2$-score, which is defined as: $$ R^2 = 1 - \frac{SS_{res}}{SS_{tot}}, $$ where $SS_{res}$ is the sum of squares of of the difference between the model and the data (the sum of squares of residuals), and $SS_tot$ is the total sum of squares, the data minus the mean of the data squared: $$ R^2 = 1 - \frac{\sum_{i=0}^{N-1}(z_i-\tilde{f}_i)^2}{\sum_{i=0}^{N-1}(z_i-\bar{z})^2} $$

Feature scaling

All scaling done in this project is of the form:
For the features:
$ x' = x-\bar{x},$
where $x'$ is the column vector minus its mean.
For the response variables:
$ z' = z - \bar{z},$
where $z'$ is the response vector minus its mean.

Resampling techniques

The central limit theorem states that if you sample a large number of times from the same dataset, your estimates on the samples from this dataset becomes normally distributed in the limit where the number of resamplings goes to infinity. This is why we resample, we want to find estimates for the errors in our calculations.

The Bootstrap resampling technique

The Bootstrap resampling technique simply resamples randomly with replacement(!) from the dataset a specific number of times, let's say n, and then we perform our calculations for each iteration. Take the mean-squared error, for example. For each iteration of resampling, we calculate the mean-squared error. When is looped over all iterations, we just calculated the mean of the mean-squared error : $$ \bar{MSE} = \frac{1}{n}\sum_{i=0}^{n}MSE_i. $$ Further we could find the standard deviation of this mean-squared error: $$ \sigma_{MSE} = \sqrt{\sum_{i=0}^{n}(MSE_i-\bar{MSE})^2}. $$ From this point we could get a confidence interval. We must keep in mind though, that we are not actually finding the true normal distribution, as we are not resampling an infinite amount of times, but we can get an estimate for the confidence interval, the mean and it's variance. This can be generalized to any property of the dataset we would like to calculate.

k-fold cross-validation technique

We have a dataset, and the k-fold cross-validation technique, splits this dataset into k folds. Each fold gets to be the test set once, and thus gets to be the training set k-1 times. For each k, we calculate the property we want to examine, and after all folds have had its run, we just take the mean of these calculations. This is done to validate our calculations.

Results from exercise 4 and 5

Comparison of MSE and R2-score

Ridge regression on dataset created through Franke function

In the figure below, a contour map of the MSE against lambda and polynomial degree is plotted. The lambdas are $\lambda=10^i$ for all integers i from -15 to 5. The degree of the polynomial is the other axis, which ranges from 0 to 19. The number of datapoints in the meshgrid is 1024. The figure will be in the Github-repository linked to in the introduction.

The optimal parameters was found to be $\lambda_{optimal}=10^{-5}$ and a degree of the polynomial of 15. The 95 % confidence interval of the minimum of $MSE_{Ridge}$ is $$ MSE_{Ridge} = 8.83\dot 10^{-3} \pm 0.93\dot 10^{-3} $$ Best R2: 0.8777741112024953, std: 0.006536959537957135 and the best R2: $$ R2_{Ridge} = 0.878 \pm 0.007 $$

Lasso regression on dataset created through Franke function

There is a contour plot in the Github-repository linked to in the introduction. Exactly the same procedure as for the Ridge regression, the optimal parameters lambda and degree of polynomial was found. The optimal parameters were $\lambda=10^{-3}$ and polynomial degree of 16. They yielded the 95% confidential interval for the MSE:
$$ MSE_{LASSO} = 9.269\dot 10^{-2} \pm 1.5\dot 10^{-4} $$ and the R2-score wasn't recorded in time.

Results and discussion of exercise 6

Data preparation

The terrain data used in exercise 6 is one of the example terrains from Norway: SRTM_data_Norway_1.tif. The data had to be cut significantly for us to be able to process the data in a reasonable time. The terrain has been reduced from a 3601x1801 matrix to a 30x30 matrix, giving us 900 datapoints to work with.

Results

OLS

When studying the OLS against the degree of polynomial fit, using the bootstrap resampling technique with 100 iterations of resampling, we found that the minimum MSE was when we tried fitting the data to a polynomial of degree 17. The magnitude of the MSE was 6.4, but the variance was large. Therefore, the confidence interval of the MSE using 100 bootstrap resamplings is $$ MSE_{OLS} = 6.4 \pm 13.1. $$ The R2-score was calculated to be 0.992, but variance wasn't calculated. The reason we didn't increase the number of resampling iterations is due to the enormous amount of time it would take to run the LASSO fit with that number of iterations, and we would like to compare.

Ridge

When studying the Ridge, we wanted first to find the optimal parameters for it. Therefore, we ran a bootstrap resample with 100 iterations, calculating MSE and R2 for each iteration, on all $\lambda=10^i$, for all integers i in the range $[-15, 5]$, and for each value of lambda we checked all polynomial degrees in the range $[0, 19]$. We have a contourplot for this in the figure below. We found the lowest MSE to be when $\lambda=10^{-8}$ and the degree of the polynomial to be 19. Here we found that the 95% confidence interval of the MSE value was: $$ MSE_{Ridge} = 16.1 \pm 2.3. $$ The 95% confidence interval of the R2-score: $$ R2_{Ridge}0.9830 \pm 0.0025 $$ Contourplot of MSE using Ridge regression with parameters lambda and degree of polynomial is shown in the figure named BestlambdadegreeRidgeTERRAIN_N_30.png, in the Github-repo.

Lasso

For Lambda, we used the same procedure as for Ridge, with same intervals for $\lambda$ and the same range for the degree of the polynomial. The contourplot is shown in figure {ref}. The lowest MSE was found to be when $\lambda=10^{-}$ and the degree of the polynomial was . Here we found the 95% confidence interval of the MSE to be: $$ MSE_{LASSO} = 22.0 + 1.4 $$ The R2-score was calculated to be 0.9720, unknown standard deviation, but probably low as the standard deviation in the MSE is so small compared to the mean. Contourplot of MSE using Lasso regression with parameters lambda and degree of polynomial is shown in the figure named BestlambdadegreeLassoTerrain_N_30.png, in the Github-repo.

Importing needed libraries and functions

Useful functions

EXERCISE 1

Ordinary Least Square (OLS) on the Franke function.

Finding which polynomial fit to use

Here OLS is implemented for 40 data points with and without noise and up to 5th degree polynomial. As shown below, due to the lack of a spike in MSE, the model does not overfit since the number of data points is a much larger value compared to the maximum degree. Overfitting may be observed in the following scenarios:

1) Decreasing the number of data

2) Increasing the max degree

We tried the second one and increased the max degree of polynomial.

Overfitting:

Overfitting transpires when the model makes an almost perfect fit for your train data and fits almost every point so the model can only predict the train data perfectly and can no longer predict a new set of data. Basically, your model is supposed to grasp the general patten of the data so it can predict the new batch of data with the same pattern. but overfitting means getting lost in the detail of the training data.

To prevent overfitting we can watch out for MSE of the test data. Generally, larger degree of polynomila result in better MSE because the model tries fits the data points better. The model goes put of its way to fit the points. As we can for degree = 0 the fit is linear and the simplest model (underfitting). The more complex the model the better MSE. But we have to look out for overfitting.

We calculate MSE for train and test for each polynomial degree and plot it to investigate. It is clear in the plots that after a certain polynomial degree the MSE for test start to divert from the train and gets worst while the train gets better and better. So the minimum test MSE before overfitting is actualy the best model. It can olso be obsered with $R^{2}$ score as it goes to 1 with higher polynomial degree.

Plots

The two groups of plots below, one group contains up to fifth order polynomial which leads to no visible overfitting and the other contains up to 42nd order. We show the scaling without noise up to fifth order polynomial and due to the very slight or no difference between scaled and non-scaled data we decided to make the following plots.

Without noise: Scaled and non-scaled up to 5th order polynomial

With noise: Scaled and non-scaled up to 42nd order polynomial

1.1 Without noise: Scaled and non-scaled up to 5th order polynomial

1.1.1 Scaled up to 5th order polynomial

Here we can see that without noise the scaling has no effect on MSE or $R^{2}$.

1.1.2 Not scaled up to 5th order polynomial

1.2 With noise: Scaled and non-scaled up to 42nd order polynomial

1.2.1 Scaled up to 42nd order polynomial

To see the overfitting we decided to increase the max polynomila degree without changing the number of datapoints to be able to compare it to the plot above. The results clearly indicated the overfitting after polynomial degree 20. The large MSE occuring after degree 20 correlates very well with the $R^{2}$, as it shows a smaller score after degree 20. Additionally, the smallest MSE and highest $R^{2}$ score occurs exactly on the same degree in this particular case.

1.2.2 Not scaled up to 42nd order polynomial

We compare below the effect of scaling on the noise. Adding the noise leads to an uneven plot of MSE for training and test data as well as possible outliers, and it can be deduced after scaling that the scaling achieves smoother curves and conversely has no effect on the minimum value of MSE.

In this case scaling should help. It will reduce the effect of outliers and help our model to fit better. Different scaling use different methods. Here we use Standard scaler which subtracts the mean of each feaure from each element of that feature in our design matrix. This will generally produce a range of data in a smaller interval. Subtracting the mean value of a particular feature enables scaling down the possible outliers or large values causing the data to represent the feature better.

As seen in the results the minimum value of MSE and the maximum value for $R^{2}$ is the same before and after scaling, but the optimal polynomial order on the scaled data is much smaller occuring at the 9th polynomial order while the scaled data yields 20th. This gives the model more time to train the model.

In the figures we can see that $R^{2}$ becomes more negative with increased noise and MSE become larger.

2. Make own code and compare with SciKit

Solving the system using ordinary least squares

3. Calculating confidence intervals in beta values using the best polynomial (n=5, for N=100)

The same process for OLS is repeated above, except for n = 5 and N=100, using the analytical expression for the variance and calculating the 95% confidence interval for the beta parameters. A 95% confidence interval indicates that the true mean value for beta lies within the interval and with estimation of such an interval one can be 95% confident that the value generated using this model will be in this interval. The general trend is that with beta parameters higher than three show larger confidence intervals, and larger beta parametersare often associated with higher variance which will increase the confidence interval.

Confidence Intervals of Beta

Beta index Confidence interval[95%]
0 [-0.32, 0.55]
1 [6.42, 16.31
2 [0.94, 9.00]
3 [-69.73, -23.14
4 [-40.53, -5.88]
5 [-30.72, 6.02]
6 [10.18, 114.78]
7 [28.37, 105.62]
8 [-4.35, 63.35]
9 [-42.41, 39.40]
10 [-81.89, 28.36]
11 [-120.39, -38.35]
12 [-53.92, 20.97]
13 [-73.08, -2.49]
14 [-18.90, 66.81]
15 [-22.72, 20.96]
16 [11.75, 48.06]
17 [-5.50, 32.43]
18 [-19.12, 13.85]
19 [4.07, 35.50]
20 [-31.92, 1.93]

EXERCISE 2

1. HASTIE

The analytical expression for OLS is used and it yields a plot somewhat similiar to Hastie. The results show that we expect to see the possible bias-variance optimum at around 38th polynomial degree, if we are to observe overfitting. The same number of data points and maximum degree cannot unfortunately be applied for the bias-variance analysis due to improper results (the reason is attributed to the number of data points and is discussed later).

2. Bias-Variance tradeoff

Theory behind Bias-Variance tradeoff

Bias-variance analysis utilizes the bootstrap method to calculate MSE for training and test data, allowing us to make a reliable prediction on MSE test value. The bias measures the deviation of the mean value of the predicted values from the corresponding data points and therefore measures the degree of dissimilarity in behaviour between model and data. Because the model is usually too simple to fit the data properly in the beginning, generating a large difference between the model and the data, high bias is expected in this region; and as the model complexity increases the bias is also expected to decrease as the model improves the fitting of the training data. The variance term represents the variance in the model itself (the variance of the predicted mean values) and increases with model complexity during which overfitting starts to occur. In a bias-variance analysis the ideal scenario is a balance between bias and variance (low bias and low variance) to determine the optimal degree. Three other situations may also appear, all of which are non-ideal: 1) high bias and low variance; 2) low bias and high variance; and 3) high bias and high variance. Lastly, the error term is the normally distributed noise in the model.

Rewriting the cost function
$$ C(\textbf{X},\beta) = \frac{1}{n}\sum_{i=0}^{n-1} \, (y_{i}-\widetilde{y}_{i})^{2} = \mathbb{E}\left[(\textbf{y}-\widetilde{\textbf{y}})^{2} \right] = \mathbb{E}\left[(\;\overbrace{ \color{red} {(f\, + \epsilon ) } }^{y} -\widetilde{\textbf{y}})^{2} \right] $$$$ = \mathbb{E}\left[(\; (f\, + \bf{ \epsilon }) \overbrace{ \color{red}{ \, -\mathbb{E(\widetilde{\bf y})} + \mathbb{E(\widetilde{\bf y} })} }^{Expansion} -\widetilde{\textbf{y}})^{2} \right] = \mathbb{E}\left[\; ( \; \underbrace{ [f -\mathbb{E(\widetilde{\bf y})} ] } \; \underbrace{- [\widetilde{\textbf{y}} - \mathbb{E(\widetilde{\bf y})}] }+\underbrace{ \epsilon })^{2} \right] $$$$ = \mathbb{E}\left[\; \underbrace { ( f -\mathbb{E(\widetilde{\bf y})} \, ) ^{2} } + \, \underbrace {(\,\widetilde{\textbf{y}} - \mathbb{E(\widetilde{\bf y})\,})^{2} } +\, \underbrace { \epsilon^{2} }\; \underbrace {- \, 2 \,( f -\mathbb{E(\widetilde{\bf y})} \, )\; (\, \widetilde{\bf y}\, - \mathbb{E(\widetilde{\bf y})} \,) } \, + \, \underbrace { 2 \, ( f -\mathbb{E(\widetilde{\bf y})} \, )\; \epsilon } \; \; \underbrace { - \, 2 \, (\, \widetilde{\bf y}\, - \mathbb{E(\widetilde{\bf y})} \, )\; \epsilon } \, \right] $$$$ = \mathbb{E} ( f -\mathbb{E(\widetilde{\bf y})} \, ) ^{2} + \mathbb{E}(\,\widetilde{\textbf{y}} - \mathbb{E(\widetilde{\bf y})\,})^{2} + \overbrace { \mathbb{E}(\epsilon^{2}) }^{= \, \sigma^{2}} \; - \, 2 \, \mathbb{E}( \;f -\mathbb{E(\widetilde{\bf y})} \, \; )\; \; ( \overbrace { \mathbb{E(\widetilde{\bf y})} \, - \underbrace { \mathbb{E(\mathbb{E(\widetilde{\bf y})})}\, }_{\mathbb{E (\widetilde{\bf y})}} \, }^{= \, 0}) \, +\, \mathbb{E} \left[ 2 \, ( f -\mathbb{E(\widetilde{\bf y})} \, ) \right] \, \overbrace { \mathbb{E}(\epsilon) }^{= \, 0} \; $$$$ + \, 2 \, \mathbb{E}( \;f -\mathbb{E(\widetilde{\bf y})} \, \; )\; \overbrace { \mathbb{E}(\epsilon) }^{=\, 0 } - \; 2 \, ( \, \overbrace { \mathbb{E(\widetilde{\bf y})} \, - \underbrace { \mathbb{E(\mathbb{E(\widetilde{\bf y})})}\, }_{\mathbb{E (\widetilde{\bf y})}} }^{= \, 0} \, )\; \overbrace {\mathbb{E}(\epsilon)}^{= \, 0} $$$$ = \mathbb{E} ( f -\mathbb{E(\widetilde{\bf y})} \, ) ^{2} + \mathbb{E}(\,\widetilde{\textbf{y}} - \mathbb{E(\widetilde{\bf y})\,})^{2} + \sigma^{2} = \underbrace { \frac{1}{n}\sum_i^{}( f_i -\mathbb{E(\widetilde{\bf y})} \, ) ^{2} }_{bias} + \underbrace { \frac{1}{n}\sum_i^{}(\,\widetilde{y_i} - \mathbb{E(\widetilde{\bf y})\,})^{2} }_{variance} + \underbrace { \sigma^{2} }_{error} $$
Computing the BV-tradeoff against degree of polynomial fit

With datapoints lower than 50 the bias suddenly increases after a certain degree at around the same time as the variance and the MSE test error. MSE test error after a sudden increase decreases back making it difficult to spot overfitting which coincides with the variance and bias. The region where the bias, variance and error are reaching a maximum are regions indicating high bias and high variance, the least desirable model.

Above 50 datapoints the bias appears normal; it decreases with complexity; as well as the error, though it indicates no overfitting due to the high number of data points. The variance, however, continues to stay at large values for all polynomials. Overall, the situation is low bias and high variance.

In our case the optimal number of data points is 50, but it can change with different number of bootstraps (number of bootstraps must not ideally exceed number of data points) and different max degrees.

2.1 Optimal bias-variance analysis:

2.2 N<50 bias-variance analysis:

2.3 N>50 bias-variance analysis:

3. Hastie vs Bias-Variance tradeoff

OLS yields optimal degree 38 which is not far from the one estimated using bias-variance analysis at 43. The Hastie figure appears to be unreliable since running different number of data points and max degree gave us different, inconsistent answers, but also because bias-variance analysis implements boostrap method a particular number of times in order to determine a more reliable model.

EXERCISE 3

Cross-validation resampling technique

_crossvalidationols(k) function takes in k and splits the indices of data points into k units. It also makes sure to split the data such that any odd data points left out become part of the correct units. i is used to count the number of times the for loop is running and enables the insertion of estimated values into the KFold scores arrays. Everytime the for loops runs it makes the kth fold the test data and assigns the test data index from concact_pre, which contains all the indices for split data. Then it proceeds to check the length of concat; if it has only one fold left or more. If there is one left, it uses the index in concat to assign this as training data. This is relevant when k = 2. The other option is when k > 2; it concatanates the remaining data as a whole training data.

After the if statement it applies the generated test data indexes and train data indexes in order to generate the test and training data. Thereafter SciKit OLS and analytical expression for OLS are applied for comparison and the KFold scores are calculated. After the code exits the for loop it takes the mean value of the scores to yield the MSE values. Lastly, SciKit's cross validation function is also implemented for further comparison. The results show a dramatic deviation from SciKit's function.

EXERCISE 4

Ridge regression, using own code

Bootstrap analysis of Ridge

Bootstrap BVTradeoff with lambda parameter

Cross-validation, Ridge

Comparison between own cv and sk cv

EXERCISE 5

Lasso

Bootstrap analysis of LASSO

Bootstrap BVTradeoff (Lasso) against lambda parameter.

Cross-validation Lasso

Comparison between own cv and sk cv for Lasso

Exercise 6

Dataset

Attempt at ordinary least square fit

OLS fit with different polynomial degrees

BVTradeoff OLS

Discussion of BVTRADEOFF

WE see blabla.

Ridge fit

Bias-Variance tradeoff, Ridge against polynomial degree

Ridge with Bootstraps resampling method

Cross-validation, ridge

Comparison with scikit

Lasso fit on dataset

Bootstrap sampling

Bootstrap analysis of Lasso against lambda

Cross validation Lasso